# State Public Health Agency, CDC, and FDA Tweets (2012 through 2022)

A data set of IDs of Tweets from state public health agencies, CDC/FDA Twitter accounts, and state-level COVID-19 communication accounts from 2012 through 2022.  Downloaded via the R package "academictwitteR" on April 9, 2023. Uploaded to Harvard Dataverse on April 9, 2023.

Author: Samuel R. Mendez, Harvard T.H. Chan School of Public Health.
Date: April 9, 2023.

## Overview
I created this data set as part of my dissertation work as a PhD candidate in the Department of Social and Behavioral Sciences at the Harvard T.H. Chan School of Public Health. My dissertation is focused on expanding the scale and scope of health literacy research methods via media studies theory and natural language processing techniques. Tweets from this data set will enable two streams of work in this dissertation. One is focused on testing the feasibility of supervised machine learning to predict human scoring of state public health agency Tweets via the CDC Clear Communication Index. The other is focused on integrating existing health literacy and natural language processing techniques to analyze public health communication on Twitter at its actual scale of production. While accomplishing these aims will require only a subset of this data set, this seemed like an opportunity to help enable more health communication research. While the initial construction of the queries and exploration of the data set was time-consuming, it was easy to expand the time scope of this data set once I had the scripts written. As such, I wanted to share this data set, as well as the manually gathered governmental agency Twitter handles and the R script required to create it. In addition to increasing my dissertation's reproducibility, I am also sharing these files so others can use them in their own work. See the License section below for more information.

## Folder structure
The files in this Dataverse entry are all in one directory. I create "twitterHandles.xlsx" manually. Then I used the script in "create_dfAllTweets.Rm" to create all the others. (Note that Harvard Dataverse automatically creates ".tab" versions of the CSV files below as part of its archival process.) I recommend you start with those two files to better understand the goals and limitations of downloading this Twitter dataset. All files are listed alphabetically and described below:
* __create_dfAllTweets_sharable.html__ is the knit version of "create_dfAllTweets_sharable.html" (April 9, 2023). Note the warning messages about existing directories that appear when carrying out API calls are due to the fact that I knit the Rmd file immediately after running all the code chunks in it. As written, the scripts using the academictwitteR package will create new directories, assuming directories with those names do not already exist. 
* __create_dfAllTweets_sharable.Rmd__ is the RMarkdown file used to create the data set. It uses the package "academictwitteR" to download state public health agency Tweets and CDC/FDA Tweets from 2012 through 2022 using Twitter's Academic Research API (v2). Finalized and knit to HTML on April 9, 2023. This is the "sharable" version of the file I used, which does not include code that relied on hard-coded associations between author IDs and states, so as to comply with Twitter policy.
* __dataBiography.html__ summarizes the contextual factors that shaped the production of this data set.
* __dataBiography.txt__ summarizes the contextual factors that shaped the production of this data set. It is in markdown format.
* __dataDictionary.csv__ provides the definitions of the variables in the RMD and CSV versions of df_all_Tweets_sharable
* __df_allTweets_sharable.csv__ is the sharable CSV version of the df_all_Tweets object created in the RMD file associated with this Dataverse upload. It contains one column of Tweet IDs, as in publicHealthTweetIDs.txt. It also contains a custom "state" column to designate the area served or the federal agency affiliation of the governmental agency author.
* __df_allTweets_sharable.Rdata__ is the sharable RData version of the df_all_Tweets object created in the RMD file associated with this Dataverse upload. It contains one column of Tweet IDs, as in publicHealthTweetIDs.txt. It also contains a custom "state" column to designate the area served or the federal agency affiliation of the governmental agency author. I created it using the saveRDS() method, and I recommend loading it into R using the readRDS() method.
* __publicHealthTweet_IDs.txt__ is the primary file in this Dataverse upload. It contains Tweet IDs for Tweets by state public health agencies, state-level pandemic response efforts, and select CDC- or FDA- associated Twitter accounts, from 2012 through 2022 (n=690281). Ready to hydrate: each ID is on its own line, with no extraneous characters. The file ends with a blank new line. Created on April 9, 2023, via the academictwitteR R package to access the Twitter API for Academic Research (v2).
* __twitterHandles.xlsx__ is a manually created list of Twitter handles, verification status, and either state served or federal agency affiliation for public health agency accounts. Created during the summer of 2022, and last updated on April 8, 2023. I used the handle/state associations in this file to make Twitter API calls and create a custom "state" variable in this data set. This file comes with its own data dictionary.

## License
This README and the above files are shared under a CC0 1.0 Universal Public Domain Dedication. In short, this means you can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. For full details, see the [CC0 1.0 license page](https://creativecommons.org/publicdomain/zero/1.0/)

Though not necessarily the norm in all use cases, I kindly ask anyone who uses this data set to cite the [Harvard Dataverse entry in which I originally published these files](https://doi.org/10.7910/DVN/VX4HK8).